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ABSTRACT 

For polar codes with short-to-medium code length, list successive 
cancellation decoding is used to achieve a good error-correcting per¬ 
formance. However, list pruning in the current list decoding is based 
on the sorting strategy and its timing complexity is high. This results 
in a long decoding latency for large list size. In this work, aiming 
at a low-latency list decoding implementation, a double threshold¬ 
ing algorithm is proposed for a fast list pruning. As a result, with a 
negligible performance degradation, the list pruning delay is greatly 
reduced. Based on the double thresholding, a low-latency list decod¬ 
ing architecture is proposed and implemented using a UMC 90nm 
CMOS technology. Synthesis results show that, even for a large list 
size of 16, the proposed low-latency architecture achieves a decod¬ 
ing throughput of 220 Mbps at a frequency of 641 MHz. 

Index Terms — Polar codes, list decoding, successive cancella¬ 
tion decoding, low latency, VLSI implementation 

1. INTRODUCTION 

Successive cancellation decoding (SCD) is proposed in [T] for de¬ 
coding polar codes, and its hardware implementation is extensively 
studied in t2l-t7^. However, for polar codes with short-to-medium 
code length, the error-correcting performance of the SCD is unsatis¬ 
factory. To improve the performance, SCDs with multiple codeword 
candidates are proposed. They are the list decoding ifTSl . lfT4l and 
its variants naini For a better performance, cyclic redundancy 
check (CRC) code is serially concatenated with polar codes and the 
CRC bits are used to choose the valid codeword from the list candi¬ 
dates (HI, mi, QH As a result, the list decoding of polar codes 
achieves or even exceeds the performance of Turbo codes GqI and 
LDPC codes (HI. However, this performance improvement is at the 
cost of a larger list size (e.g., 16 or 32) and the increased complexity 
highly desires an efficient list decoding architecture. In this work, 
the efficient and low-latency implementation of the list decoding is 
explored, aiming at promoting polar codes as a competitive coding 
candidate in both error-correcting and implementation aspects. 

The first list decoding architecture for polar codes is proposed 
in ED. In EH, the pre-computation look-ahead technique ID is 
used in the list decoding for a lower latency, while its memory size 
is tripled. In ED and ED, a small list size of 4 and 2 are used, 
respectively. When the list decoding decodes an information bit, 
the number of the codeword candidates are doubled. To maintain 
a reasonable decoding complexity, once the candidate size exceeds 
the specific list size £, some of the codeword candidates have to be 
pruned. The common pruning strategy is to sort the codeword candi¬ 
dates based on their metrics and keep the C best of them. However, 
the sorting operation incurs a large hardware and timing complexity, 
especially when C is large. In 12^ . a list decoding architecture with 
list size of 8 is proposed, and a Bitonic sorting network is customized 


for efficient sorting. Nevertheless, up to three pipeline stages are 
used by the sorting architecture. As a result, to implement the list 
decoding with large list size in hardware, list pruning architecture is 
critical, especially to achieve a low decoding latency. 

In this work, the list pruning architecture is optimized in both 
algorithmic and architectural levels. Recently, instead using log- 
likelihood (LL) to capture the metric of the list candidates, the LL 
ratio (LLR) representation is used for the list decoding ED-ED 
Benefiting from the numerical accuracy and stability of the LLR, a 
small and regular architecture of the memory and processing element 
(PE) can be used for the list decoding 1^ . Therefore, in this work, 
LLR is used in the design of the low-latency pruning architecture 
of the list decoder. Very recently in ED - 1^ . borrowing some ad¬ 
vanced techniques used in the SCD implementation |i2l, the special 
constituent codes of polar codes are utilized to reduce the latency of 
the list decoding. However, conventional sorting strategies are still 
used for their list pruning and this limits the latency reduction. 

2. RELATION TO PRIOR WORK 

The main contributions of this work are outlined as follows: 

1. Different from the previous works on list decoding ISl-IIll, a 
double thresholding strategy (DTS) is proposed to replace the 
sorting strategy for list pruning. 

2. In the architectural level, the architectures for DTS and threshold 
value update are proposed. As a result, even for a large list size, 
the logic delay of list pruning is very small. 

3. A low-latency list decoding architecture for a large list size, i.e. 
16, is implemented in the UMC 90nm CMOS technology. Its 
decoding latency is even smaller than that of list size of 8 in. 

3. LIST DECODING OF POLAR CODES 

A length N — 2^^ polar code with rate R — K/N i^ specified by the 
generator matrix Gn and a frozen set C {0,1,..., — 1} of 

cardinality \ A^\ = N — K. A source word of polar codes is denoted 
SLS un, and un C {0,1}^. It consists of K information bits Ui 
(i ^ A^) and N — K frozen bits Ui (i G A^). The information 
bit is used to deliver the data, while the frozen bit is set to a value, 
e.g., 0, pre-known by the decoder. If the r-bit CRC is used, the last 
r information bits take the CRC of the previous K — r bits. In the 
encoder, the codeword xm G {0,1}^ is generated as = uJjGn 
and sent over the physical channels. 

Let y be the noise corrupted signal of xat at the receiver. The 
LLRs input to the decoder are given as 

L° = log [Pr {y\xi = 0)] - log [Pr {y\xi = 1)] (1) 

for i = 0,1,..., — 1. The decoding process of polar codes can 

be illustrated by two trees: the decoding tree and the scheduling 



(a) Decoding tree and decoding path (b) Scheduling tree for SCD 
Fig. 1: List decoding example of polar codes with = 4 and C — 2 


tree. Fig. 1 shows an example of the trees for N — A. The decod¬ 
ing tree of a length-polar code is a depth-A^ binary tree, with m 
mapped to the nodes at depth i + 1. Its root node represents a null 
state. A path from the root node to a depth-i node represents a sub¬ 
vector [uo, ui,..., Ui_i] of the source word uat, and is named as 
the decoding path pL Specifically, a path from the root node to the 
leaf node of the decoding tree represents a source word un of polar 
codes, and the value of each bit of Uat is shown in the corresponding 
node lying at this decoding path. Notice that, if Ui is a frozen bit, it 
only assumes 0. Hence, the right sub-tree rooted at the depth-(i + 1) 
node can be pruned, as the source words included in it are not valid. 
For example, if = {0}, the gray sub-tree in Fig. 1(a) is pruned. 

Decoding the polar codes can be treated as a search problem in 
the pruned decoding tree. The conventional SCD performs a depth- 
first search. Given a partial decoding path [uo,ui,..., Ui-i], the 
SCD generates the LLR of bit Ui, denoted as L^. If i G Ui is 
decoded as = 0 irrespective of . Otherwise, a ML decision is 
made for the information bit Ui {i ^ and is given by 


Mi = 6 (L?) 


0 Lr > 0 
1 L? < 0 


( 2 ) 
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Fig. 2: Double thresholding strategy 


the depth-first traversal of the scheduling tree completes decoding 
one codeword, and the z* leaf node outputs the LLR L^. In the list 
decoding, L SCDs are deployed and executed in parallel. When they 
reach the leaf node, the SCDs are stalled and the PMU calculates 2£ 
path metrics from C L'^s with (4). After that, the LPO chooses the 
best C decoding paths. The SCD cannot be restarted until the LPO 
finishes, because the subsequent SCD operations need the knowl¬ 
edge of the updated decoding paths . Therefore, the delay of the 
PMU and the LPO result in an increased latency of the list decoding. 

From (4), the PMU is implemented with an adder array and its 
logic delay is small. However, a sorting strategy is used for the LPO 
in the conventional list decoding architecture 1^ - 1241 . For a shorter 
delay, a parallel sorting architecture ED is used in ED and ED. 
However, its hardware complexity is O (£^), and hence it becomes 
inefficient for large C. On the other hand, a Bitonic sorting network 
is used in Ell, and its delay also scales with C. Next, to achieve a 
short-delay LPO and hence a low-latency decoder, a double thresh¬ 
olding algorithm and its corresponding architecture are proposed. 


Based on the decision rule in (2), single decoding path from the root 
to the leaf is obtained in the decoding tree, e.g., the red path in Fig. 
1(a), and it is the source word un decoded by the SCD. 

For a better error-correcting performance, a breadth-first search 
is performed by the list decoding. To constrain the searching 
complexity, a list size C is set. Let C decoding paths at depth 
i of the decoding tree be denoted as p] = [uo,u[,... 

I = 0, — 1. For each path candidate pj, a path metric is 

associated with it and denoted as pm]. When decoding the informa¬ 
tion bit Ui, C decoding paths are extended to 2C paths. From 1^ . 
the path metrics of the two extensions of the path p] are given by 

pm]^^ (ui) = pm] + log 1^1 + j (3) 


4. LOW-LATENCY LIST DECODING IMPLEMENTATION 
4.1. Double Thresholding Strategy 

In this sub-section, a list pruning strategy with small logic delay is 
introduced. Based on the 2£ path metrics from PMU, it approxi¬ 
mately finds the C smallest pms and their corresponding path exten¬ 
sions. To achieve it, the properties of the 2C path metrics in (4) are 
firstly studied and presented in the following proposition. 

Proposition 1. Assume C path metrics at depth i of the decoding 
tree are sorted and 

pmo < pm\ < ■ < pm] < pmj+i < • • • < pm^_i, (5) 


where m assumes 0 and 1, c^sponding to the left and right exten- extended to 2C path metrics with (4). If the subset of 

sions of p]. In the hardware ES, (3) is approximated by ^^j+i 


Z+1 / \ 

pmi {ui) = 


pm] 

pm] + \L^\ 


if Ui^em) 
if Ui^em) 


( 4 ) 


U (T) = (ui) \pm]^^ (m) < t| , (6) 


The operation in (4) is denoted as path metric update (PMU). Based 
on the 2C pms from PMU, C extended paths with the smallest pms 
are chosen and they are the paths at depth z-|-1, i.e., p]'^^, 0 < I < C. 
This operation is named as the list pruning operation (LPO). 

From (3) or (4), it can be seen that the PMU needs the knowl¬ 
edge of L'2 and it is generated by the SCD. The SCD operation can 
be described by the scheduling tree shown in Fig. 1(b). The schedul¬ 
ing tree of a length-A^ polar code is a depth-n balanced binary tree. 
It consists of two kinds of nodes: / node and g node. The functions 
included in one node can be evaluated in one clock cycle. Generally, 


then, the cardinality off} (T) for T = pm] satisfies 

/ < |n < 21. (7) 

Due to the space limitation, the proof of Proposition 1 is not 
shown. Based on Proposition 1, the Double Thresholding Strategy 
(DTS) for list pruning is given as follows. 

Double Thresholding Strategy. Assume C path metrics at depth 
i of the decoding tree follow (5). To prune the 2C path extensions 























































Fig. 3: Threshold tracking architecture 

at depth z + 1, two thresholds, i.e. Acceptance Threshold (AT) and 
Rejection Threshold (RT), are defined and set as 

[AT,RT]=[pm%2,pm^c-i]- (8) 

The path extensions at depth z + 1 obey the following pruning rule: 

1. ifpm]'^^ (ui) < AT, the path extension is kept; 

2. ifpm]'^^ (ui) > RT, the path extension is pruned; 

3. for path extensions with AT < pm\'^^ (ui) < RT, they are 
randomly chosen such that the list size remains to be C. 

Fig. 2(a) illustrates the DTS for LPO. Assume the 2C extended 
pms are sorted and the top path extension has the smallest path met¬ 
ric. If the list is exactly pruned, the top C path extensions will be the 
decoding paths at depth z + 1. However, when the DTS is used, the 
shaded paths are reserved for depth z + 1. As shown in Fig. 2(a), 
from Proposition 1, DTS.l ensures that at least £/2 best decoding 
paths are kept. Moreover, the number of the reserved paths does not 
exceed C. On the other hand, since |0 {prrTc-i) | > C — 1, DTS .2 
efficiently excludes the path extensions that are definitely not in the 
set of the C best paths. Finally, when the number of the paths kept by 
DTS.l is smaller than £, DTS.3 will fill up the C path candidates. 
Notice that the number of the pruned paths in DTS.2 is no greater 
than C. Therefore, DTS.3 is always used to fill up the decoding list. 

From Fig. 2(a), the performance degradation of the DTS is due 
to DTS.3. If RT is loose, some decoding path belongs to the C 
best paths may not be chosen by DTS.3. To alleviate this, a tighter 
(smaller) RT can be assumed. For example, the value of RT in ( 8 ) can 
be replaced by pm\. (k < C — 1). As shown in Fig. 2(b), by doing 
so, the number of the candidates that DTS.3 can choose decreases, 
and hence the probability that the chosen decoding path belongs to 
the C best paths increases. However, from (7), when RT = pm\ 
(k < C — 1), it is possible that more than C decoding paths will be 
pruned by DTS.2. As a result, DTS.3 is not always able to fill up the 
C path candidates, as depicted in Fig. 2(c). Hence, if RT value is 
too small, the performance will become poor, as the decoding paths 
are aggressively pruned. Therefore, an optimal value of RT exits. 

Finally, from the hardware implementation perspective, the 
complexity of the DTS is much smaller than that of the conventional 
sorting strategy. To implement DTS.l and DTS.2, 4C comparators 
are sufficient and all the comparison operations can be executed 
in parallel as the pms are compared with the same fixed threshold 
values. To implement DTS.3, the circuits based on the priority en¬ 
coder are used. Most importantly, due to the parallel nature of the 
DTS, the logic delay of the DTS is much shorter than that of the full 
sorting strategy. As a result, the PMU together with the DTS can be 
finished in one clock cycle. 



Fig. 4: Block diagram of the low-latency list decoding architecture 


4.2. Threshold Tracking Architecture 

To support the DTS block, the values of AT and RT are needed. 
These values are calculated by the Threshold Tracking Architecture 
(TTA) shown in Fig. 3. From Section 4.1, AT and RT used at depth 
z + 1 of the decoding tree depend on the path metric at depth z. 
Therefore, the TTA can be executed in parallel with the list decoding 
in extending the path from depth z to z + 1. This leads to a relaxed 
timing budget for TTA, and it can be executed in multiple cycles. 

From ( 8 ), the TTA finds the median and the maximum values of 
the C input numbers. Finding the median is more complicated and its 
implementation is based on the following property of the medians. 

Proposition 2. Assume W numbers {zi;o,zui, ... ,zuw-i} satisfy 
the following properties: 

Wq <Wi < ‘ “ < Wmo < • • • < Ww/2-1 
Ww/2 < Ww/2+1 < ■ < Wmi < ‘ < WW-1 

where Wmo and Wmi are the medians of {wq, ..., Ww/2-1} and 
{ww/2, • • • 5 ww-i}, respectively. If the median of {zuo, zui ,..., 
ww-i} is denoted as Wm, then, 

{ 'OJrn ^ — 'OJ-rriQ ^ "OJrni 

Wm G {zUmi, • . .,WW-l,Wo, . . . , ZUmo } Wmo > wA\ 0 ) 

Wm — Wmo ~ Wm\ Wmo ~ Wm\ 

Proposition 2 can be recursively used to find the value of AT. Fig. 
3(a) shows the corresponding architecture for £ = 8 . It consists of 
two radix-£/2 sorters 12^ , £ — 1 MUXes, and log 2 £ comparators. 
As shown in Fig. 3(a), the £ path metrics are evenly divided into 
two groups and passed through the radix-£/2 sorter. As a result, the 
metrics in each group are sorted, i.e., Wq<w\<W2< w^ and 
w\ < w^ < Wq < Wj. From Proposition 2, by comparing and 
Wq, the size of the median candidate set is halved. In Fig. 3(a), the 
comparison result of W 2 and Wq controls the 4 MUXes at stage 2 and 
they choose {zuq, zui , '^ 2 ? '^3} based on (10). Moreover, Wq < wf 
and W 2 < wl. Hence, similar comparison and MUX architectures 
can be used for the following stages. As a result, after log 2 £ stages, 
the median of the inputs, i.e., AT for the next depth, is obtained. 

To find the value of RT, the architecture is simpler. If the max¬ 
imum path metric is adopted as RT as ( 8 ), the maximum of and 
ZU7 in Fig. 3(a) is RT. If the second maximum path metric is taken 
for a tighter RT, it can be found by the architecture in Fig. 3(b). 

4.3. List Decoding Architecture 

The top-level architecture of the proposed low-latency list decoding 
is shown in Fig. 4. It contains £ SCDs and each SCD is implemented 
with a semi-parallel architecture of M < N/2 processing elements 
(PEs) ITOl . Based on L^s output from the SCD, the PMU generates 
2£ pms from £ stored pms with (4). Out of these 2£ pms, £ pms 
are chosen by the DTS and they are stored in the register of the PMU. 
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Fig. 5: Timing diagram of the low-latency list decoding architecture 
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Fig. 6: List decoding timing diagram with frozen sibling 


Based on registered pms, the TTA computes AT and RT used in 
decoding After DTS, the memory contents related to the SCD 
need to be copied as Eli. As shown in Fig. 4, the lazy copy (LCP) 
block generates the control logic for them. Finally, when the list 
decoding reaches the leaf node of the decoding tree, the contents of 
the path memory are passed to the CRC check block. The source 
word that satisfy the CRC check is the decoding result u n • 

Fig. 5 shows the timing diagram of the proposed low-latency list 
decoding architecture, using the example in Fig. 1 for illustration. 
For simplicity, assume there are already C decoding paths in the list 
in the beginning. From Fig. 5, different from the conventional SCD, 
two additional clock cycles are inserted after each leaf node SCD 
operation of the scheduling tree. As depicted in Fig. 5, they are used 
for the list pruning by DTS and the memory manipulation by LCP, 
respectively. As a result, the latency of decoding one codeword in 
terms of clock cycle number is given by 

Td = 4iV + (n - 2 - log 2 M) N/M. (11) 

Finally, Fig. 5 also shows that the TTA is not on the critical path of 
the list decoding. At least 2 clock cycles are available for the TTA. 

4.4. Further Latency Reduction 

In this sub-section, frozen siblings are used to reduce the decoding 
latency. They are defined as [u2j,U2j-\-i] with { 2 j, 2 j + 1} C A^. 
With 0 < j < A^/2, a frozen sibling corresponds to a leaf sibling in 
the scheduling tree. For a general sibling, as shown in Fig. 5, and 
are sequentially evaluated based on the LLR 
their parent in the scheduling tree. However, for a frozen sibling, its 
path extension is fixed and given by [u2j , U2j-\-i] = [0,0]. Moreover, 
the PMU from pmf^ to can be simplified as 

= pmf^+0 |LJ7'|+0 |LJ-+\|. (12) 

It can be proven that (12) is equivalent to (4) for the frozen sibling, 
and 5 clock cycles can be saved by (12). For example, if [uq, ui] in 
Fig.l is a frozen sibling, the timing diagram of the list decoding is 
shown in Fig. 6. All the decoding operations related to the frozen 
sibling shrinks to a PMU operation (12) in one clock cycle. There¬ 
fore, the latency of the proposed list decoding is reduced to 

Td = 4iV + (n - 2 - log 2 M) N/M - bFS, (13) 
where FS is the number of frozen siblings in the given polar codes. 

5. EXPERIMENTAL RESULTS 

An (N, R,r) = (2048,1/2,16) polar code is sent over the BAWGN 
channel and decoded with different decoders. Their frame error rate 
(FER) curves are shown in Fig. 7. All the list decodings use (4) 
for PMU as in the hardware implementation. C = 2,4, 8, and 16 



Fig. 7 : Performance comparison of different decoders 
Table 1: Synthesis Results and Comparison 



This Work 

ED 

ED 

El 

Technology 

90 nm CMOS 
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N 

1024 

£ 

16 

4 

8 

4 

M 

64 

Area (mm?) 

7.46 

3.53 

8.64 

1.743 

Clock Freq. {MHz) 

641 

314 

625 

412 

Throughput (Mbps) 

220 

124 

177 

162 


are simulated for conventional list decoding with sorting strategy 
m. As a reference, the performances of the SCD and (iV, R) = 
(2304,1/2) WiMAX LDPC code |30l are also presented. Here, 25 
iterations are used for LDPC decoding. It can be seen that the per¬ 
formance of polar codes is better than that of the LDPC code, when 
£ = 16 list decoding is used. Finally, three DTSs are used for 
£ = 16. Their RTs assume pmls, pml^, and pml^, respectively, 
and AT is fixed to pm\. Fig. 7 indicates that pm\ 4 ^ is the opti¬ 
mal value of RT and the performance degradation of the resulting 
low-latency list decoding is smaller than 0.02 dB. 

The architecture shown in Fig. 4 is implemented for £ = 16 
to decode {N,R) — (1024,1/2) polar codes. The quantization 
scheme of El is used, i.e., 6 bits for channel LLR and 8 bits for 
path metric. The design is synthesized using a UMC 90 nm CMOS 
technology, and Table I summarizes the synthesis results. Due to a 
large list size, the area of the LLR memory in our implementation is 
large and equals to 4.5 mnA. For the target polar codes, FS = 231 
and decoding throughput can be obtained from (13). From the ta¬ 
ble, the proposed architecture achieves a decoding throughput of 220 
Mbps, and it is even greater than that of list size of 8 in El. The 
results in Table I demonstrate the effectiveness of the proposed low- 
latency list decoding architecture with double thresholding. 

6. CONCLUSION 

For a low-latency list decoding, a double thresholding strategy 
(DTS) is proposed for fast list pruning. With a negligible per¬ 
formance degradation, the DTS greatly reduces the pruning logic 
delay. Based on the DTS, the low-latency list decoding architec¬ 
ture is proposed. Comparison results demonstrate that the proposed 
architecture achieves a much lower latency for a large list size. 
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